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A classic result of Johnson and Lindenstrauss asserts that any set of n points in 
d-dimensional Euclidean space can be embedded into ^-dimensional Euclidean space 
— where k is logarithmic in n and independent of rf — so that all pairwise distances 
are maintained within an arbitrarily small factor. All known constructions of such 
embeddings involve projecting the points onto a spherically random ^^-dimensional 
hyperplane through the origin. Here we give two novel constructions of such 
embeddings, having the additional property that all elements of the projection matrix 
belong in {—1,0, -hi}- This makes our constructions particularly well suited for 
database environments, as the computation of the embedding reduces to evaluating 
a single aggregate over k random partitions of the attributes. 



1. INTRODUCTION 

Consider projecting the points of your favorite sculpture first onto the plane and then 
onto a single line. The result amply demonstrates the power of dimensionality. 

Conversely, given a high-dimensional pointset it is natural to ask whether it could 
be embedded into a lower dimensional space without suffering great distortion. In this 
paper, we will consider this question for finite sets of points in Euclidean space. It will 
be convenient to think of n points in as an n x table (matrix) A with each point 
represented as a row (vector) with d attributes (coordinates). 

Given such a matrix representation of the pointset, one of the most commonly used 
embeddings is the one suggested by the Singular Value Decomposition of A, That is, in 
order to embed the n points into M* we project them onto the A:-dimensionaI space spanned 
by the singular vectors corresponding to the k largest singular values of A. If one rewrites 
the result of this projection as a (rank k)nxd matrix Ak, we are guaranteed that for any 
other fc-dimensional pointset represented as an n x d matrix D, 

\A-Ak\F<\A^D\ir , 

where, for any matrix Q, \Q\% = E^L'- TointerpretthisresultobservethatHF measures 
the embedding's distortion as follows: for each point (row) consider the difference- vector 
between the original and the new position; the distortion is then the sum of the squared 
lengths of all such vectors. Thus, if moving a point by z takes energy proportional to z^, 
then Ak represents the fe-dimensional configuration reachable firom A with least energy. 
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It turns out that .4^ is also "optimal" under many other matrix norms. Specifically, it is 
well-known that for any rank A: matrix D and for any rotationally invariant norm 

\A-Ai,\<\A-D\ . 

For each such norm, just as we did above for | ■ [f^ one can give a natural interpretation 
of how it measures the global distortion resulting from the embedding. At the same time, 
though, in all these cases there are no guarantees whatsoever regarding local properties of 
the resulting embedding. For example, it is not hard to devise examples where the new 
distance between a pair of points is arbitrarily smaller than the original distance. 

The lack of any such local guarantees makes it very hard to exploit such embeddings 
algorithmically. In a seminal paper [9], Linial, London and Rabinovich were the first to 
consider embeddings that respect local properties and their algorithmic apphcations. By 
now, such embeddings have become an important tool in algorithmic design. 

A real gem in this area has been the following result of Johnson and Lindenstrauss [7]. 

Lemma 1 . 1 ([7]). Given c> 0 and an integer n, let kbea positive integer such that 
k>ko = 0(e"^ log n). For every set Pofn points in there exists f -^M!" such 
that for all u,v e P 

(1 - t)\\u - vf < \\S{u) - S{v)\\^ < (1 + e)\\u - vf . 

We will refer to embeddings providing a guarantee akin to that of Lemma 1.1 as 
JL-embeddings. In the last few years, such embeddings have been useful in solving a 
variety of problems. The rough idea is the following. By providing a low dimensional 
representation of the data, JL-embeddings speed up certain algorithms dramatically, in par- 
ticular algorithms whose run-time depends exponentially in the dimension of the working 
space (for a number of practical problems the best known algorithms indeed have such 
behavior). At the same time, the provided guarantee regarding pairwise distances often 
allows one to establish that the solution found by working in the low dimensional space is 
a good approximation to the solution in the original space. We give a few examples below. 

Papadimitriou, Raghavan, Tamaki and Vempala [10], proved that embedding the points 
of A in a low-dimensional space can significantly speed up the computation of a low 
rank approximation to A, without significantly affecting its quality. In [6], Indyk and 
Motwani showed that JL-embeddings are useful in solving the e-approximate nearest 
neighbor problem, where (after some preprocessing of the pointset P) one is to answer 
queries of the following type: "Given an arbitrary point x, find a point y £ P, such that for 
every point z e Py \\x - z\\ > (1 - e)\\x - y\\" In a different vein, Schulman [11] used 
JL-embeddings as part of an approximation algorithm for the version of clustering where 
we seek to minimize the sum of the squares of intracluster distances. Recently, Indyk [5] 
showed that JL-embeddings can also be used in the context of "data-stream" computation, 
where one has limited memory and is allowed only a single pass over the data (stream). 

1.1. Oiir contribution 

Over the years, the probabilistic method has allowed for the original proof of Johnson 
and Lindenstrauss to be greatly simplified and sharpened, while at the same time giv- 
ing conceptually simple randomized algorithms for constructing the embedding [4, 6, 3]. 
Roughly speaking, all such algorithms project the input points onto a spherically random 
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hyperplane through the origin. While conceptually simple, in practice all such algorithms 
amount to multiplying A with a dense matrix of real numbers. This can be a non-trivial task 
in many practical computational environments. Moreover, it is mathematically interesting 
to investigate the precise role of spherical symmetry in our choice of hyperplane. 

Our main resuh, below, asserts that one can replace projections onto random hyperplanes 
with much simpler and faster operations. In particular, in a database environment these op- 
erations can be implemented readily using standard SQL primitives without any additional 
functionality. Somewhat surprisingly, we prove that this comes without any sacrifice in the 
quality of the embedding. In fact, we will see that for every fixed value of d we can get 
slightly better bounds than all current methods. 

We state our result below as Theorem 1.1. Following that, we discuss how to compute 
the embedding in terms of database operations. As in Lemma 1 . 1 , the parameter e controls 
the accuracy in distance preservation, while now /? controls the probability of success. 

Theorem l.l. Let P be an arbitrary set of n points in M^, represented asannxd 
matrix A, Given e, > 0 let 

For integer k > ko, let R be a d x k random matrix with R{i,j) = rij, where {vij} are 
independent random variables from either one of the following two probability distributions: 



with probability 1/2 
1/2 



Let 



{+1 with probability 1/6 
0 2/3 
-1 1/6 . 



E = ^AR 



and let / : E"* ^ 1^ map the i^^ row of A to the i*^ row of E. 
With probability at least 1 - n~^,for all u,v e P 

{1 - c)\\u - v|p < ||/(ti) - /WIP < (1 + e)||u - v\\' 



We see that to construct a JL-embedding via Theorem 1 . 1 , we need a very simple proba- 
bility distribution to generate the projection matrix, while the computation of the projection 
itself reduces to aggregate evaluation. Moreover, when r^j e {-1, +1}, the construction 
is also conceptually extremely simple. On the other hand, when r^j £ {-1,0, +1}, we get 
a threefold speedup, as we only need to process a third of all attributes for each of the k 
coordinates. 

Database-friendliness. To apply the theorem in a database system using, say, the second 
distribution above one needs to generate k new attributes, each one formed by performing 
the same random experiment: throw away 2/3 of the original attributes at random; partition 
the remaining attributes randomly into two equal parts; for each partition, produce a new 
attribute equal to the sum of all attributes; take the difference of the two sum-attributes. 
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Projecting onto random lines. Looking a bit more closely into the computation of 
the embedding we see that each row (vector) of A is projected onto k random vectors 
whose coordinates {r^J} are independent random variables with mean 0 and variance 1. 
If the {r^j} were independent Normal random variables with mean 0 and variance 1, it is 
well-known that each resulting vector would point to uniformly random direction in space. 
Projections onto such vectors have been considered in a number of settings, including 
the work of Kleinberg on approximate nearest neighbors [8] and of Vempala on learning 
intersections of halfspaces [12]. More recently, such projections have also been used in 
learning mixture of Gaussians models, starting with the work of Dasgupta [2] and later 
with the work of Arora and Kannan [1]. 

Our proof implies that for any fixed vector a, the behavior of its projection onto a random 
vector c is mandated by the even moments of the random variable 1 1 a - c| | . In fact, our result 
follows by showing that for every vector a, under our distributions for {r^J }, these moments 
are dominated by the corresponding moments for the case where c is spherically symmetric. 
As a result, projecting onto vectors whose entries are distributed like the columns of matrix 
R could replace projection onto spherically random vectors; it is computationally simpler 
and results in projections that are at least as nicely behaved. 

Randomization. Perhaps a naive attempt at constructing JL-embeddings would be to 
pick k of the original coordinates in d-dimensional space as the new coordinates. Naturally, 
as two points can be very far apart while differing only along a single original dimension, 
this approach is doomed. At the same time, though, if for each pair of points, all coordinates 
contributed "roughly equally" to the their distance, then a sampling scheme as above would 
make a lot of sense. Thus, it is very natural to first apply a random rotation to the original 
pointset in E'^ and then, say, pick the first k of the resulting coordinates as our new 
coordinates. Of course, this is exactly the same as projecting onto spherically random 
ib-dimensional hyperplane! The random rotation can be viewed as a form of insurance, 
similar to the random permutation usually applied before applying Quicksort. 

Derandomization. Finally, we note that Theorem 1.1 allows one to use significantly 
fewer random bits than all previous methods for constructing JL-embeddings. While the 
amount of randomness needed is still quite large, such attempts for randomness reduction 
are of independent interest and our result can be viewed as a first step in that direction. 

2. PREVIOUS WORK 

As we will see, in all methods for producing JL-embeddings, including ours, the heart 
of the matter is showing that for any vector, the squared length of its projection is sharply 
concentrated around its expected value. The original proof of Johnson and Lindenstrauss [7] 
uses quite heavy geometric approximation machinery to yield such a concentration bound. 
That proof was greatly simplified and sharpened by Frankl and Meahara [4] who explicitly 
considered a projection onto k random orthonormal vectors (as opposed to viewing such 
vectors as the basis of a random hyperplane), yielding the following result. 

Theorem 2.1 ([4]). For any e e (0, 1/2), any sufficiently large set P € M^, and 
k>ko= \9{€^ - 2e^ /3)-^ log \P\] + 1, there exists amapfiP-^B!' such that for all 

(1 - €)||^i - v\f < |l/(u) - f{v)\f <{1 + e)\\u - v\f . 
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The next great simplification of the proof of Lemma l.I was given, independently, by 
Indyk and Motwani [6] and Dasgupta and Gupta [3], the latter also giving a slight sharpening 
of the bound for ko. By combining the analysis of [3] with the viewpoint of [6] it is in fact 
not hard to show that Theorem LI holds if for all i, r^j = iV(0, 1). Below we state our 
rendition of how each of these simplifications were achieved as it prepares the ground for 
our own work. Let us write X = Y io denote that X is distributed as Y and recall that 
iV(0, 1) denotes the standard Normal random variable having mean 0 and variance 1 . 

[6]: Assume that we try to implement the scheme of Frankl and Maehara [4] but we are 

lazy about enforcing either normality (unit length) or orthogonality among our k vectors. 
Instead, we just pick our k vectors independently, in a spherically symmetric manner, by 
taking as the coordinates of each vector independent iV(0, 1) random variables and then 
merely scaling each vector by 1/Vd so that its expected length is 1 . 

An immediate gain of this approach is that now, for any fixed vector a, the length of 
its projection onto each of our vectors is also a Normal random variable. This is due to a 
powerful and deep fact, namely the 2-stability of the Gaussian distribution: for any real 
numbers ai , a2, . . . , O-d, if is a family of independent Normal random variables 

and X = Y,i=i ^^^^ ^ = 1)' where c = (af 4- - - + a^Y^'^. As a result, if 

we take these k projection lengths to be the coordinates of the embedded vector in , then 
the squared length of the embedded vector follows the Chi-square distribution for which 
strong concentration bounds are readily available. 

Remarkably, very little is lost due to our laziness. Although, we did not explicitly enforce 
either orthogonality, or normality, the resulting k vectors, with high probability, will come 
very close to having both of these properties. In particular, the length of each of the k 
vectors is sharply concentrated (around 1) as the sum of d independent random variables. 
Moreover, since the k vectors point in uniformly random directions in M**, they get rapidly 
closer to being nearly orthogonal as d grows. 

[3]: Here we will exploit spherical symmetry without appealing directly to the 2-stability 
of the Gaussian distribution. Instead observe that, by symmetry, the projection of any unit 
vector a on a random hyperplane through the origin is distributed exactly like the projection 
of a random point from the surface of the d-dimensional sphere onto a fixed subspace of 
dimension k. Such a projection can be studied readily, though, as now each coordinate is a 
scaled Normal random variable. With a somewhat tighter analysis than [6], this approach 
gave the strongest known bound, namely k>ko = {4-{- 2!3){e^/2 - e^/3)~^ logn, which 
is exactly the same as the bound in Theorem L L 

3. SOME INTUITION 

Our contribution begins with the realization that spherical symmetry, while making life 
extremely comfortable, is not essential. What is essential is concentration. So, at least in 
principle, one is free to consider other candidate distributions for the {rij}, if perhaps at 

the expense of comfort. 

As we saw earlier, each column of our matrix R will give us a coordinate of the projection 
in and thus the squared length of the projection is merely the sum of the squares of these 
coordinates. So, effectively, the projection is equivalent to the following: each column 
acts as an independent estimator of the original vector's length (its estimate being the inner 
product with it) and in the end we take the consensus estimate (sum) of our A: estimators. 
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Seen from this angle, requiring the k vectors to be orthonormal has the pleasant statistical 
overtone of "maximizing mutual information" (since all estimators have equal weight and 
are orthogonal). Nonetheless, even if we only require that each column simply gives an 
unbiased, bounded variance estimator, the Central Limit Theorem implies that if we take 
sufficiently many columns, we can get an arbitrarily good estimate of the original length. 
Naturally, the number of columns needed, depends on the variance of the estimators. 

From the above we see that the key issue is the concentration of the projection of 
an arbitrary fixed vector a onto a single random vector. The main technical difficulty, 
resulting from giving up spherical symmetry, is that this concentration can depend on a. 
Our technical contribution lies in determining probability distributions for the {r^j } under 
which, for all vectors, this concentration is at least as good as in the spherically symmetric 
case. In fact, it will turn out that for every fixed value of d, we can get a (minuscule) 
improvement of concentration. Thus, for every fixed d, we can actually get a strictly better 
bound for k, albeit marginally, than by taking spherically random vectors. 

The reader might be wondering "how can it be that perfect spherical symmetry does not 
buy us anything (and is in fact slightly worse for each fixed d)T\ To at least show why 
we don't lose too much by giving up spherical symmetry, we have the following intuitive 
argument. When all vectors are not equal with respect to the variability of the length of 
their projection an adversary could try to pick a worst-case such vector w. So, we can 
rephrase the question as *'How much are we empowering the adversary by committing to 
picking our column vectors among lattice points rather than arbitrary points in M^?". 

As we will see, and this lies at the heart of our proof, the worst-case vector w is 
^ (1, . . . , 1) (along with all 2^ vectors resulting by sign-flipping w"s coordinates). So, 
the worst-case vector, at least in terms of the magnitudes of its coordinates, turns out to be 
a more or less "typical" vector, unlike say (1, 0, . . . , 0). Therefore, it is not hard to believe 
that the adversary would not fare much worse by replacing w with a spherically random 
vector. In that case, though, the adversary does not benefit at all from our commitment! 

To get a more satisfactory answer, it seems like one has to delve into the proof In 
particular, both for the spherically random case and for our distributions, the bound on k is 
mandated by the probabiHty of overestimating the projected length. Thus, the "bad events" 
amount to the spanning vectors being too "well-aligned" with a. Now, in the spherically 
symmetric setting it is possible to have alignment that is arbitrarily close to perfect, albeit 
with correspondingly smaller probability. In our case, if we don't have perfect alignment 
then we are guaranteed a certain, bounded amount of misalignment. It is precisely this 
tradeoff between the probability and the extent of alignment that drives the proof 

Consider, forexample, the case when = 2withr,-,- G {"1,+1}. As we said above, the 
worst case vector is = (l/v^)(l,l). So, with probability 1/2 we have perfect alignment 
(when our random vector is ±w) and with probability 1 /2 we have orthogonality. On the 
other hand, for the spherically symmetric case, we have to consider the integral over all 
points on the plane, weighted by their probability under the two-dimensional Gaussian 
distribution. It's a rather instructive exercise to explore this tradeoff directly and might also 
give the interested reader some intuition for the general case. 

4. PRELIMINAMES AND THE SPHEMCALLY SYMMETRIC CASE 
4.1 • Preliminaries 

Let X ' y denote the inner product of vectors x, y . To simplify notation in the calculations 
we will work with matrix R scaled by 1/Vd and, as a result, to get E we need to scale 



! Please wnte \t it lertiiiiiinghead{< (Shortened) Article Title>> m fi le ! 



7 



ylxi^by yS/^ rather than So, is a random matrix with J) ^njjyfd, 

where the {r^^} are distributed as in Theorem 1.1. Therefore, if Cj denotes the column 
of then {cj}J=i is a family of i.i.d. random unit vectors in ^ and for all a € E**, 

In practice, of course, such scaling can be postponed until after the matrix multiplica- 
tion (projection) has been performed, so that we maintain the advantage of only having 
{-1,0, +1} in the projection matrix. 

Let us start by computing E(||/(a) IP) for an arbitrary vector a € 1"^. Let be 
defined as 

Then 

E(g,) =E (^-L^a,r«) = ^^E^^^M =0 , (1) 

and 

^ \i=l l:=l / 

^ i=l 1=1 m=l 

= I X llalp . (2) 



Note that to get (1) and (2) we only used that {^j} are independent, E(r,j) = 0 and 
Var(rjj) = 1. Now from (2) we see that 

E(||/(a)in=E^llvW^(a>Ci,...,a-c^)||) j = ^ E E (Q^^) = . 

That is for any independent family of {nj} with E(rij) = 0 and Var(nj) = 1 we get 
an independent estimator, i.e., E = ll^lP- Note now that in order to have a 

JL-embedding we need that for each of the (2) pairs u,v ^ the squared norm of the 
vector w - is maintained within a factor of 1 ± e. Therefore, if for some such family of 
{rij } we can further prove that for some > 0 and any fixed vector a€R^, 

Pr[(l - e)\\af < \\fia)\\' < {1 + e)\\a\\'] >1--^ . 0) 

then the probability ofnot getting a JL-embedding is bounded by (3) x 2/71^+^ < 
Thus, our entire task has been reduced to determining a zero mean, unit variance distribution 
for the {r.j } such that (3) holds for any fixed vector a. In fact, since for any fixed projection 
matrix, 1|/(q;)!P is proportional to HaH^, it suffices to prove that (3) holds for arbitrary mkiV 
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vectors. Finally, observe that since E(|| /(a) || 2) = l|a|| 2, inequality (3) merely asserts that 
the random variable is concentrated around its expectation. 

4.2. The spherically symmetric case 
As a warm up for proving concentration for our distributions for the {r^^ }, let us first 
wrap up the spherically random case. Getting a concentration inequality for ] |/{a) | p when 
= N{0, 1) is straightforward. Due to the 2-stability of the Normal distribution, for 
every unit vector a, we have \\f{a)\\^ - x'i^)/^^ where xH^) denotes the Chi-square 
distribution with k degrees of freedom. The fact that we get the same distribution for 
every vector a corresponds to the intuition that "all vectors are the same" with respect 
to projection onto a spherically random vector. Standard tail-bounds for the Chi-square 
distribution readily yield the following. 

LEMMA 4.1. Let nj = iV{0, 1) for all Thenjor any e > 0 and any unit-vector 

Pr[|l/(a)|p>l + e] < exp ^ (6^2 - €^3)) , 
Pr[||/(a)lP<l-e] < exp (^-^(6V2 - ^^3)) . 

Thus, to get a JL-embedding we need only require 

2xexp(^-|(eV2-eV3))<^ , 

which holds for 

Let us note that the bound on the upper tail of ||/(a)| |^ above is tight (up to lower order 
terms). As a result, as long as the union bound is used, one cannot hope for a better bound 
on k while using spherically random vectors. 

To prove our result we will use the exact same approach, arguing that for every unit vector 
a G IR^, the random variable is sharply concentrated around its expectation. In 

the next section we state a lemma analogous to Lemma 4.1 above and show how it follows 
from bounds on certain moments of Qj. We prove those bounds in Section 6. 

5. TAIL BOUNDS 

To simplify notation let us define for an arbitrary vector a, 

k k 
5 = 5{a) = ^{a-c,)"-5]Q?(a) , 

where Cj is the column ofR, so that \\f{a)\\^ = 5 x d/k. 



! Please write \t it leruiuiinghead{< (Shortened) Article Title>> m fi le ! 



9 



Lemma 5.1. Let have any one of the two distributions in Theorem LL Then, for 
any e > 0 and any unit vector a G E**, 

Pr [5(a) > (1 + t)kld\ < exp (-^(^V2 - e'/S)) , 
¥r[S{a)<{\-e)kld\ < exp (^-^{6^/2 - 6^3)) - 



In proving Lemma 5.1 we will generally omit the dependence of probabilities on a, 
making it explicit only when it affects our calculations. We will use the standard technique 
of applying Markov's inequality to the moment generating function of 5, thus reducing the 
proof of the lemma to bounding certain moments of Qi. In particular, we will need the 
following Jemma which will be proved in Section 6. 



Lemma 5.2. For all h € [0, d/2), alld>l and all unit vectors a. 



E(exp(/.Qrf))<-^j_, 
E(Qa(a)*) < | - 



(4) 
(5) 



Proof of Lemma 5.1. We start with the upper tail. For arbitrary ft > 0 let us write 



Pr 5>(l + e)- 



Pr 



exp(/iS) > exp ^/i(l + e)^) 
< E (exp {hS)) exp ^- A(l + e) ^ 

Since {<3i}*=i are i.i.d. we have 

E(exp(ft5)) = E^nexp(/iQ^)j 

= nE(exp(ftQ|)) 

= (E(exp(/iQ?)))* , 



(6) 

(7) 
(8) 



where passing from (6) to (7) uses that the {(3,}*=i are independent, while passing from 
(7) to (8) uses that they are identically distributed. Thus, for any e > 0 



Pr 



S >(1 + < (E (exp {hQ\))f exp [-h{l + e)^) . (9) 



Substituting (4) in (9) we get (10). To optimize the bound we set the derivative in (10) 
with respect to ft too. This gives ft = lif^ < f. Substitutingthisvalueof ft weget(ll) 
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and series expansion yields (12). 



Pr 



5>(l + e)- 



= ({l + e)exp{-e))*/^ 



exp(-|(£V2-6V3)) 



(10) 

(11) 
(12) 



Similarly, but now considering exp(-ftS) for aAitrary /i > 0, we get that for any e > 0 



Pr 



5<(l-e)-| < 



(E(exp(-ftQ?)))*exp(^ft(l-e)^j . (13) 



Rather than bounding E (exp {-fiQl) ) directly, let us expand exp(-ftQf ) to get 

S2N X * 



Pr 



5< (1-e)- 



< {E\l-hQi + 



,2 . i-m 



2! 



exp ( h{l - e)^ 



- (l-J + yE(gt))'exp(^ft(l-e)^) , (14) 



where E((5f ) was given by (2). 

Now, substituting (5) in ( 1 4) we get ( 1 5). This time taking ft = | is not optimal but 
is still "good SQOugh" giving (16). Again, series expansion yields (17). 



Pr 



S<(l-e)J < 1-^ + 2^ 



exp {h{\ - e)^ (15) 

= ^^(w^j '''' 

< exp(^-^(eV2-eV3)) . 



(17) 

□ 



6. MOMENT BOUNDS 

To simplify notation in this section we will drop the subscript and refer to Qi as Q. It 
should be clear that the distribution of Q depends on a, i.e., Q ~ Q(a). This is precisely 
what we give up by not projecting onto spherically symmetric vectors. Our strategy for 
giving bounds on the moments of Q will be to determine a '^worst case" unit vector w and 
bound the moments of Q{w). We claim the following. 



Lemma 6.1. Let 
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For every unit vector a € M^, and for all k — 0,1,.., 

E(Q(af*)<E(Q(«;f*) . (18) 



We will also prove that the even moments of Qiw) are dominated by the corresponding 
moments from the spherically symmetric case. That is, 

Lemma 6.2. Let 

T = N{0,l/d) . 

Foralld> 1 and all k = 0,1,. . . 

E{Q{wf^)<E{T^^) . (19) 



Using Lemmata 6.1 and 6.2 we can prove Lemma 5.2 as follows. 
Proof of Lemma 5.2. To prove (5) we observe that for any unit vector a, by (1 8) and (19), 

E{Q{a)')<E{Q{wr)<E{T') , 

while 

To prove (4) we first observe that for any real-valued random variable U and for all h 
such that E (exp [hU^) ) is bounded, the Monotone Convergence Theorem (MCT) allows 
us to swap the expectation with the sum and get 

.k=0 ^' / k=Q 



E (exp {hU^)) . E 5: U ^ (t/-) 



So, below, we proceed as follows. Taking h 6 [0,d/2) makes the integral in (20) 
converge, giving us (21). Thus, for such h, we can apply the MCT to get (22). Now, 
applying (18) and (19) to (22) gives (23). Applying the MCT once more gives (24). 

^ -^exp(-AV2)exp(^/i^jdA (20) 

(21) 



1 



- 2h/d 
A:! 



OO , 

E|rE{r^*) (22) 

k=0 



>XfrE(Q(«n (23) 
= E(exp(/iQ(a)2)) . (24) 



Thus, E (exp (hQ^)) < - 2h/d for ft 6 [0, d/2), as desired. 
Before proving Lemma 6.1 we will need to prove the following lemma. 
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Lemma 6.3. Let rj , r2 be i.i.d. r.v. having one of the following two probability distri- 
butions: Tx e { — 1, +1), each value having probability 1/2, or,rt G {— V5, 0, +a/3} with 
0 having probability 2/3 and ±y/d being equiprobable. 

For any a,beMletc= ^(a^ + b'^)/2. TJienfor any M eR and all k = 0,1, . 

E ((M + an + br2f^) < E ((M + en + crs)^*) . 



Proof. We first consider the case where n G {-1, +1}, each value having probability 1/2. 
lfa^ = lP then a — c and the lemma holds with equality. Otherwise, observe that 



E ({M + an + crs)'^) - E ({M + an + fers)'^) = 



5fe 
4 



where 



Sk = (M + 2c)2^-h2M2^ + (M"2c)2^-(M + a + 6)2^ 
-{M + a-&f*-(M-a + &)2'^-(M~a-&)2'^ . 

We will show that Sk>0 for all k>0. 

Since 7^ &^ we can use the binomial theorem to expand every term other than 2M^^ 
in Sk and get 

2k 



i=0 

where 



Z},- = (2c)*+ (-2c)^- (a + 6)^- (a - 6)^- (-a + b)'- {-a - 6)^ . 

Observe now that for odd % — 0. Moreover, we claim that D2j > 0 for all j > 1. To 
see this claim observe that (2a^ + 2b^) = (a + + (a - b)^ and that for all j >1 and 
y > Oj (ar + yy > a;^ + 2/^. Thus, 



5fc = 2M^^ + 



A: 

2A; 



The proof for the case where n G {-\/3, 0, +\/3} is just a more cumbersome version 
of the proof above, so we omit it. That proof, though, brings forward an interesting point. 
If one tries to take r^ — 0 with probability greater than 2/3, while maintaining a range of 
size 3 and variance 1, the lemma fails. In other words, 2/3 is tight in terms of how much 
probability mass we can put to = 0 and still have the current lemma hold. □ 

Proof of Lemma 6.1. Recall that for any vector a, Q{a) = Qi (a) = a ♦ ci where 

ci = ^ (rii,...,rdi) . 

If a = (ai,...,ad) is such that a? — a| for all then by symmetry, Q{a) and 
Q{w) are identically distributed and the lemma holds trivially. Otherwise, we can assume 
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without loss of generality, that al ^ al and consider the "more balanced" unit vector 
0 ~ (c, qsj . . . , ad), where c — y^(aj + oc\)j2. We will prove that 



E{Q(a)''^) <E(Q(e)2^) 



(25) 



Applying this argument repeatedly yields the lemma, as d eventually becomes w. 

To prove (25), below we first express E (Q(q;)^*^) as a sum of averages over rii,r2i 
and then apply Lemma 6.3 to get that each term (average) in the sum, is bounded by the 
corresponding average for vector d. More precisely, 



E(Q(a)2^) = ;^^E((Af + airn+a2r23)'^)Pr 



M 



A M 



^ ^l]E((M + crn+cr2i)^^)Pr 

M 



Proof of Lemma 6.2» Recall that T = N{0,l/d), We will first express T as the scaled 
sum of d independent standard Normal random variables. This will allow for a direct 
comparison of the terms in each of the two expectations. 

Specifically, let {T^jf^i be a family of i.i.d. standard Normal random variables. Then 
Yli=i ^ Normal random variable with variance d Therefore, 

1 ^ 

T g i^r, . 

Recall also that Q{w) = Qiiw) = w * ci where 

To simplify notation let us write = Yi and let us also drop the dependence of Q on 
Thus, 



where {Fj}f^j are i.i.d. r.v. having one of the following two distributions: G {-1, +1}, 
each value having probability 1/2, or Yi e {-VS, 0, H-VS} with 0 having probability 2/3 
and ±\/3 being equiprobable. 

We are now ready to compare E {Q^^) with E (T^^) . We first observe that for every 



^ E(T,,.-.r,,J ,and 



«I=1 t2*=l 



El 
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To prove the lemma we will show that for every value assignment to the indices i , 

E(y.,-->;,J<E(T.,---r,,J . (26) 

Let V = (vi , V2, . . . , V2k) be the value assignment considered. For i £ {1, .... d}, let 
cv{i) be the number of times that i appears in V\ Observe that if for some i, cy{i) is 
odd then both expectations appearing in (26) are 0, since both {Y^jf^i and {Tt}^^^ are 
independent families and E(l^) = E(Tt) = 0 for all i. Thus, we can assume that there 
exists a set {ii ,7*2, - - - , jp} of indices and corresponding values , ^2, • • - , such that 

E (F,, ■ ■ ■ F,, J = E (F^f ^ l^f • - ■ yI'^) , and 
E (r., • • • T,, J E [tI'^T^^^ . ■ - T|f^) . 

Note now that since the indices ii,i2, - • - ,ij> are distinct, {IjdJLi and {Tj^j^Lj are 
families of i.i.d. r.v. Therefore, 

E (Yi, ■ ■ ■ Yi,,) = E (y,f - )x...xe(yI'^), and 

E • • • Ti, J = E (r//') X • • • X E (rlf") . 

So, without loss of generality, in order to prove (26) it suffices to prove that for every 
^ = 0,1,... 

E (Fi^^) < E (rf 0 . (27) 

This, though, is completely trivial. First recall the well-known fact that the (2f )th moment 

of A^(0, 1) is i2£ - 1)!! = (2^)!/(^12^) > 1. Now: 
- If Fi e {-1, +1} then E (Y^^^) ^ I, for all £>0. 

-IfFi € {-v^,0,+V3}thenE(yi2^) = 3^"^ < (2^) !/(«2^), where the last inequality 
follows by an easy induction. 

It is worth pointing out that, along with Lemma 6.3, these are the only two points were 
we used any properties of the distributions for the r^j (here called F,) other than them 
having zero mean and unit variance. □ 

Finally, we note that 

• Since E [Yi^) < E (Tf ^) for certain we see that for each fixed d, both inequal- 
ities in Lemma 5.2 are actually strict, yielding slightly better tails bounds for 5 and a 
correspondingly better boimd for ko. 

• By using Jensen's inequaHty one can get a direct bound for E(Q^^) when Yi e 
{—1, -hi}, i.e., without comparing it to E(T^^). That simplifies the proof for that case and 
shows that, in fact, taking Yi e {-1, +1} is the minimizer of E (exp [hQ^)) for all h, 
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